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Abstract 

Automatically describing video content with natural lan¬ 
guage is a fundamental challenge of multimedia. Recurrent 
Neural Networks (RNN), which models sequence dynamics, 
has attracted increasing attention on visual interpretation. 
However, most existing approaches generate a word locally 
with given previous words and the visual content, while the 
relationship between sentence semantics and visual content 
is not holistically exploited. As a result, the generated sen¬ 
tences may be contextually correct but the semantics (e.g., 
subjects, verbs or objects) are not true. 

This paper presents a novel unified framework, named 
Long Short-Term Memory with visual-semantic Embedding 
(LSTM-E), which can simultaneously explore the learning 
ofLSTM and visual-semantic embedding. The former aims 
to locally maximize the probability of generating the next 
word given previous words and visual content, while the 
latter is to create a visual-semantic embedding space for 
enforcing the relationship between the semantics of the en¬ 
tire sentence and visual content. Our proposed LSTM-E 
consists of three components: a 2-D and/or 3-D deep con¬ 
volutional neural networks for learning powerful video rep¬ 
resentation, a deep RNN for generating sentences, and a 
joint embedding model for exploring the relationships be¬ 
tween visual content and sentence semantics. The exper¬ 
iments on YouTube2Text dataset show that our proposed 
LSTM-E achieves to-date the best reported performance in 
generating natural sentences: 45.3% and 31.0% in terms of 
BLEU@4 and METEOR, respectively. We also demonstrate 
that LSTM-E is superior in predicting Subject-Verb-Object 
(SVO) triplets to several state-of-the-art techniques. 

1. Introduction 

Video has become ubiquitous on the Internet, broadcast¬ 
ing channels, as well as personal devices. This has en¬ 
couraged the development of advanced techniques to ana¬ 
lyze the semantic video content for a wide variety of ap¬ 
plications. Recognition of videos has been a fundamental 


Input Video: 

Output Sentence: 

• LSTM: a man is riding a horse. 

• LSTM-E: a woman is riding a horse. 

• Humans: a woman gallops on a horse. / a woman is riding a 
horse along a road. / the girl rode her brown horse. 

Figure 1. Examples of video description generation. Input: a short 
video. Output: a natural language sentence describing the main 
content of the input video. 

challenge of multimedia for decades. Previous research has 
predominantly focused on recognizing videos with a pre¬ 
defined yet very limited set of individual words. Thanks 
to the recent development of Recurrent Neural Networks 
(RNN), researchers have strived to automatically describe 
video content with a complete and natural sentence, which 
can be regarded as the ultimate goal of video understanding. 

Figure shows the examples of video description gen¬ 
eration. Given an input video, the generated sentences are 
to describe video content, ideally encapsulating its most in¬ 
formative dynamics. There is a wide variety of video appli¬ 
cations based on the description, ranging from editing, in¬ 
dexing, search, to sharing. However, the problem itself has 
been taken as a ground challenge for decades in the research 
communities, as the description generation model should 
be powerful enough not only to recognize key objects from 
visual content, but also discover their spatio-temporal rela¬ 
tionships and the dynamics expressed in a natural language 
as well. 

Despite the difficulty of the problem, there have been 
a few attempts to address video description generation 
|[6l[32l[35l, and image caption generation 1171 1141 fTTlI^I^ . 
which are mainly inspired by recent advances in machine 
translation using Recurrent Neural Networks (RNN) EEa. 
The standard RNN is a nonlinear dynamical system that 
maps sequences to sequences. Although the gradients of 
the RNN are easy to compute, RNN models are difficult to 




train, especially when the problems have long-range tempo¬ 
ral dependencies, due to the well-known “vanishing gradi¬ 
ent” effect ll4l [2T1l . As such, the Long Short-Term Memory 
(LSTM) model was proposed to overcome the vanishing 
gradients problem by incorporating memory units, which 
allow the network to learn when to forget previous hid¬ 
den states and when to update hidden states da. LSTM 
has been successfully adopted to several tasks, e.g., speech 
recognition 0, language translation m and image caption 
(2011331. Thus, we follow this elegant recipe and use LSTM 
as our RNN model to generate the video sentence in this 
paper. 

Moreover, existing video description generation ap¬ 
proaches mainly optimize the next word given the input 
video and previous words locally, while leaving the rela¬ 
tionship between the semantics of the entire sentence and 
video content unexploited. As a result, the generated sen¬ 
tences can suffer from robustness problem. It is often the 
case that the output sentence from existing approaches may 
be contextually correct but the semantics (e.g., subjects, 
verbs or objects) in the sentence are not true. For example, 
the sentence generated by LSTM-based model for the video 
in Figure is “a man is riding a horse,” which is correct 
in logic but the subject “man” is not relevant to the video 
content. 

To address the above issues, we leverage the semantics 
of the entire sentence and visual content to learn a visual- 
semantic embedding model, which holistically explores the 
relationships in between. Specifically, we present a novel 
Long Short-Term Memory with visual-semantic Embed¬ 
ding (LSTM-E) framework to bridge video content and nat¬ 
ural language, as shown in Figure Given a video, a 2-D 
and/or 3-D Convolution Neural Networks (CNN) is utilized 
to extract visual features of selected video frames/clips, 
while the video representation is produced by mean pool¬ 
ing over these visual features. Then, a LSTM for generat¬ 
ing video sentence and a visual-semantic embedding model 
are jointly learnt based on the video representation and sen¬ 
tence semantics. The spirit of LSTM-E is to generate video 
sentence from the viewpoint of mutual reinforcement be¬ 
tween coherence and relevance. Coherence expresses the 
contextual relationships among the generated words with 
video content which is optimized in LSTM, while relevance 
conveys the relationship between the semantics of the entire 
sentence and video content which is measured in the visual- 
semantic embedding. 

In summary, this paper makes the following contribu¬ 
tions: 

• We present an end-to-end deep model for automatic 
video description generation, which incorporates both 
spatial and temporal structures underlying video. 

• We propose a novel Long Shot-Term Memory with 


visual-semantic Embedding (LSTM-E) framework, 
which considers both the contextual relationship 
among the words in sentence, and the relationship be¬ 
tween the semantics of the entire sentence and video 
content, for generating natural language of a given 
video. 

• The proposed model is evaluated on the popular 
Youtube2Text corpus and outperforms the-state-of- 
the-art in terms of both Subject-Verb-Object (SVO) 
triplet prediction and sentence generation. 

The remaining parts of the paper are organized as fol¬ 
lows. Section reviews related work. Section presents 
the problem of video description generation, while Section 
1^ details our solution of jointly modeling embedding and 
translation. In Sectionwe provide empirical evaluations, 
followed by the discussions and conclusions in Section 

2. RELATED WORK 

There are mainly two directions for translation from vi¬ 
sual content. The first direction predefines the special rule 
for language grammar and split sentence into several parts 
(e.g. subject, verb, object). With such sentence fragments, 
many works align each part with visual content and then 
generate the sentence for corresponding visual content: CSl 
use Conditional Random Field (CRF) model to produce 
sentence for image and in (H, a Markov Random Field 
(MRF) model is proposed to attach a descriptive sentence 
to the given image. For video translation, Rohrbach et al. 
(^ learn a CRF to model the relationships between dif¬ 
ferent components of the input video and generate descrip¬ 
tions for video. Guadarrama et al mil use semantic hier¬ 
archies to choose an appropriate level of the specificity and 
accuracy of sentence fragments. This direction is highly de¬ 
pended on the templates of sentence and can only generate 
sentence with syntactical structure. 

Another direction is to learn the probability distribution 
in the common space of visual content and textual sentence. 
In this direction, several works explore such probability dis¬ 
tribution using topic models (31 [131 and neural networks 
lO [Tbl [20l [31] |32| [33l [35l. They can generate sentence 
more fiexibly. Most recently, several methods have been 
proposed for visual to sentence task based on the neural 
networks and most of them are utilizing the RNN due to 
its successful use in sequence to sequence learning for ma¬ 
chine translation dEIl. Kiros et al. CD firstly take the 
neural networks to generate sentence for image by propos¬ 
ing a image-text multimodal log-bilinear neural language 
model. In another work by Mao et al. (20l, a multimodal 
Recurrent Neural Networks (m-RNN) model is proposed 
for image to caption, which directly models the probability 
of generating a word given previous words and image. In 
(33l, Vinyals et al propose an end-to-end neural networks 
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Figure 2. An overview of our LSTM-E framework with a language generating LSTM and a visual-semantic embedding model (better 
viewed in color). The video representation is produced by mean pooling over the visual features of frames/clips, extracted by a 2-D/3-D 
CNN. The relevance loss is to measure the relationships between the semantics of the entire sentence and video content in the embedding 
space, while the coherence loss is to characterize the contextual relationships among the generated words in the sentence in LSTM. Both 
LSTM and visual-semantic embedding are jointly learnt by minimizing the two losses. 


system by utilizing LSTM to generate sentence for image. 
For video translation, an end-to-end LSTM based model is 
proposed in l(32l . which only reads the sequence of video 
frames and then generates a natural sentence. The model is 
further extended by inputting both frames and optical flow 
in ED. Yao et al. propose to use a 3-D convolutional neural 
networks for modeling video clip dynamic temporal struc¬ 
ture and an attention mechanism to select the most relevant 
temporal clips |[35ll . Then, the resulting video representa¬ 
tions are fed into the text-generating RNN. 

Our work belongs to the second direction. However, 
most of the above approaches in this direction mainly focus 
on optimizing the contextual relationship among words to 
generate sentence given visual content, while the relation¬ 
ship between the semantics of the entire sentence and vi¬ 
sual content is not fully explored. Our work is different that 
we claim to generate video sentence by jointly exploiting 
the two relationships, which characterize the complemen¬ 
tary properties of coherence and relevance of a generated 
sentence, respectively. 

3. Video Description Generation 

Our goal is to generate language sentences for videos. 
What makes a good sentence? Beyond describing important 
persons, objects, scenes, and actions by words, it must also 
convey how one word leads to the next. Speciflcally, we de- 
flne a good sentence as a coherent chain of words in which 
each word influences the next through contextual informa¬ 
tion. Furthermore, the semantics of the entire sentence must 


be relevant to the video content. We begin this Section by 
presenting the problem formulation, and followed by the 
proposal of two losses on measuring coherence and rele¬ 
vance, respectively. 

3.1. Problem Formulation 

Suppose we have a video V with sample frames/clips 
(uniform sampling) to be described by a textual sentence S, 
where S = {wi^W 2 ^..., consisting of Ng words. Let 
V G and G denote the -dimensional vi¬ 
sual features of a video V and the -dimensional textual 
features of the t-th word in sentence S, respectively. As a 
sentence consists of a sequence of words, a sentence can be 
represented by a x Ng matrix W = [wi, W 2 ,wn^], 
with each word in the sentence as its column vector. Fur¬ 
thermore, we denote another feature vector s in the text 
space for representing a sentence as a whole. 

In the video description generation problem, on one 
hand, the generated descriptive sentence must be able to 
depict the main contents of a video precisely, and on the 
other, the words in the sentence should be organized coher¬ 
ently in language. Therefore, we can formulate the video 
description generation problem by minimizing the follow¬ 
ing energy loss function 

E{V, 5) = (1 - A) X Er{w, s) + A X Ey{w, W) , (1) 

where F^^(v,s) and F^c(v,W) are the relevance loss and 
coherence loss, respectively. The former measures the rel¬ 
evance degree of the video content and sentence semantics 
















































































































and we build an visual-semantic embedding for this pur¬ 
pose, which is introduced in Section |3.2[ The latter es¬ 
timates the contextual relationships among the generated 
words in the sentence and we use LSTM-based RNN as 
our model, which is presented in Section [33| The tradeoff 
between these two competing losses is captured by linear 
fusion with a positive parameter A. 

3.2. Visual-Semantic Embedding 

In order to effectively represent the visual content of a 
video, we first use a 2-D and/or 3-D deep convolutional 
neural networks (CNN), which is powerful to produce a 
rich representation of each sampled frame/clip from the 
video. Then, we perform “mean pooling” process over all 
the frames/clips to generate a single -dimension vector 
V for each video V. The sentence feature s is produced by 
the feature vectors wt{t = 1, 2,of each word in the 
sentence. We first encode each word Wt as “one-hot” vector 
(binary index vector in a vocabulary), thus the dimension of 
feature vector Wt, i.e. is the vocabulary size. Then the 
binary TF weights are calculated over all words of the sen¬ 
tence to produce the integrated representation of the entire 
sentence, denoted by s G , with the same dimension 
of Wf. 

We assume that a low-dimensional embedding exists for 
the representation of video and sentence. The linear map¬ 
ping function can be derived from this embedding by 

Ve = T^v and Se = T^s, (2) 

where De is the dimensionality of the embedding, and 
G and G transformation 

matrices that project the video content and semantic sen¬ 
tence into the common embedding, respectively. 

To measure the relevance between the video content and 
semantic sentence, one natural way is to compute the dis¬ 
tance between their mappings in the embedding. Thus, we 
define the relevance loss as 

£;r(v,s) = ||T„V - TsS||2 . (3) 

We strengthen the relevance between video content and 
semantic sentence by minimizing the relevance loss. As 
such, the generated sentence is expected to better manifest 
the semantics of videos. 

3.3. Translation by Sequence Learning 

Inspired by the recent successes of probabilistic se¬ 
quence models leveraged in statistical machine translation 
16113^. we define our coherence loss as 

Ee(v,W) = -logPr(W|v). (4) 

Assuming that a generative model of W that produces 
each word in the sequence in order, the log probability of 


the sentence is given by the sum of the log probabilities 
over the word and can be expressed as: 

Ns 

logPr(W|v) = y]logPr(wt|v,wo,...,Wt_i). (5) 

t=0 

By minimizing the coherence loss, the contextual relation¬ 
ship among the words in the sentence can be guaranteed, 
making the sentence coherent and smooth. 

In video description generation task, both the relevance 
loss and coherence loss need to be estimated to complete the 
whole energy function. We will present a solution to jointly 
model the two losses in a deep recurrent neural networks in 
the next sections. 

4. Joint Modeling Embedding and Translation 

Following the relevance and coherence criteria, this 
work proposes a Long Short-Term Memory with visual- 
semantic Embedding (LSTM-E) model for video descrip¬ 
tion generation. The basic idea of LSTM-E is to translate 
the video representation from a 2-D and/or 3-D deep con¬ 
volutional network to the desired output sentence by using 
LSTM-type RNN model. Figure shows an overview of 
LSTM-E model. In particular, the training of LSTM-E is 
performed by simultaneously minimizing the relevance loss 
and coherence loss. Therefore, the formulation presented 
in Eq.([^ is equivalent to minimizing the following energy 
function 

E(V,5) = (1-A)x||T„v-T,s||2- 

Ns 

Ax X: logPr(wt|v,wo,...,Wt_i;6>;T„;T5) 

t=0 

( 6 ) 

where 0 are the parameters of our LSTM-E models. 

In the following, we will first present the architecture 
of LSTM memory cell, followed by jointly modeling with 
visual-semantic embedding. 

4.1. Long Short Term Memory 

We briefly introduce the standard Long Short-Term 
Memory (LSTM) ifT^ . a variant of RNN, which can cap¬ 
ture long-term temporal information by mapping input se¬ 
quences to a sequence of hidden states and then hidden 
states to outputs. To address the vanishing gradients prob¬ 
lem in traditional RNN training, LSTM incorporates a 
memory cell which can maintain its states over time and 
non-linear gating units which control the information flow 
into and out of the cell. As much light has been threw on 
LSTM recently, many improvements have been made to the 
LSTM architecture on its original formulation fV^ . We 
adopt the LSTM architecture as described in 1361 . which 
omits the peephole connections in previous work do). 


A diagram of the LSTM unit can be seen in Figure It 
consists of a single memory cell, an input activation func¬ 
tion, an output activation function, and three gates (input, 
forget and output). The hidden state of the cell is recurrently 
connected back to the input and three gates. The mem¬ 
ory cell updates its hidden state by combining the previous 
cell state which is modulated by the forget gate and a func¬ 
tion of the current input and the previous output, modulated 
by the input gate. The forget gate is a critical component 
of the LSTM unit, which can control what to be remem¬ 
bered and what to be forgotten by the cell and somehow can 
avoid the gradient from vanishing or exploding when back 
propagating through time. Having been updated, the cell 
state is mapped to (-1,1) range through an output activation 
function which is necessary whenever the cell state is un¬ 
bounded. Finally, the output gate determines how much of 
the memory cell flows into the output. These additions to 
the single memory cell enable LSTM to capture extremely 
complex and long-term temporal dynamics which is impos¬ 
sible for traditional RNN. 

The vector formulas for a LSTM layer forward pass are 
given below. For timestep t, and are the input and 
output vector respectively, T are input weights matrices, 
R are recurrent weight matrices and b are bias vectors. 
Logic sigmoid a(x) = and hyperbolic tangent 

are element-wise non-linear activation 
functions, mapping real numbers to (0,1) and (—1,1) sepa¬ 
rately. The dot product and sum of two vectors are denoted 
with 0 and @, respectively. Given inputs x^, and 
the LSTM unit updates for timestep t are: 


g* = ./.(Tgx* + Rjh*-! + bj) 

cell input 

i* = (t(T,x* + + bj) 

input gate 

f‘ = a(T/x‘ + R/h*-i+b/) 

forgetgate 

c* = g* © i* + 0 f* 

cell state 

o* = (7 (ToX* + Roh*“^ + bo) 

output gate 

h‘ = © o* 

cell output 


4.2. LSTM with Visual-Semantic Embedding 

By further incorporating a visual-semantic embedding, 
our LSTM-E architecture is to jointly model embedding and 
translation. In the training stage, given the video-sentence 
pair, the inputs of LSTM are the representations of the video 
and the words in the sentence after mapping into the embed¬ 
ding. As mentioned above, here we train the LSTM model 
to predict each word in the sentence given the embedding 
of visual feature for video and previous words. There are 
multiple ways that can be used to combine the visual con¬ 
tent and words in LSTM unit updating procedure. The first 
one is to feed the visual content at each time step as an extra 
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Figure 3. A diagram of an LSTM memory cell. 


input for LSTM to emphasize the visual content frequently 
among LSTM memory cells. The second one only inputs 
the visual content once at the initial step to inform the whole 
memory cells in LSTM about the visual content. As empir¬ 
ically verified in 13^ . feeding the image at each time yields 
inferior results, due to the fact that the network can explic¬ 
itly exploit noise and overfits more easily. Therefore, we 
adopt the second approach to arrange the inputs into LSTM 
in our architecture. Given the video v and its correspond¬ 
ing sentence W = [wq, wi,..., watJ, the LSTM updating 
procedure is as following: 


^ =T„v 


(7) 

^ TsWt,i e {0,.. 


(8) 

/(x‘),i e {0, 

..,7V,-1} 

(9) 


where / is the updating function within LSTM unit. Please 
note that for the input sentence W = {wq, ..., }, we 

take Wo as the start sign word to inform the beginning of 
sentence and wat^ as the end sign word which indicates the 
end of sentence, both of the special sign words are included 
in our vocabulary. Most specifically, at the initial time step, 
the video representation in the embedding is set as the input 
for LSTM, and then in the next steps, word embedding x^ 
will be input into the LSTM along with the previous step’s 
hidden state In each time step (except the initial step), 
we use the LSTM cell output to predict the next word. 
Here a softmax layer is applied after the LSTM layer to 
produce a probability distribution over all the Dg words in 














the vocabulary as 


Prt+i {u>t+i) = 


exp 

E exp (T(“)h‘) ’ 
wew ^ ^ 


( 10 ) 


where W is the word vocabulary space, is the param¬ 
eter matrix in softmax layer. Therefore, we can obtain the 
next word based on such probability distribution until the 
end sign word is emitted. 

Accordingly, we define our loss function as follows: 
E(V,5) = (l-A)x ||T„v-T,s||2- 

. ( 11 ) 

A X E logPrt(wt) 

t=l 


Let N denote the number of video-sentence pairs in the 
training dataset, we have the following optimization prob¬ 
lem: 


'|T„||2 + ||T,||2 + ||T,||2 + ||0||2 


( 12 ) 


where the first term is the combination of the relevance loss 
and coherence loss, while the rest are regularization terms 
for video embedding, sentence embedding, softmax layer 
and LSTM, respectively. 

The above overall objective is optimized over the whole 
training video-sentence pairs using stochastic gradient de¬ 
scent. By minimizing this objective function, our LSTM-E 
model takes into account both the contextual relationships 
among the words in sentence {coherence) and the relation¬ 
ships between the semantics of the entire sentence and video 
content {relevance). 

For sentence generation, there are two common strate¬ 
gies to translate the given video. The first approach is to 
sample the next word from the probability distribution at 
each timestep and set its representation in embedding space 
as the LSTM input for next timestep until the end sign 
word is sampled or the maximum sentence size is reached. 
Another method is select the top-k best sentence for each 
timestep and sets them as the candidates for next timestep 
based on which to generate new top-k best sentence. To 
make the generation process concise and efficient, we adopt 
the similar way as the latter one but set k sls 1. Therefore, 
at each timestep, we choose the word with maximum prob¬ 
ability as the predicted word and input its embedded feature 
in the next timestep until the model outputs the end sign 
word. 


5. Experiments 

In this section, we will first introduce our experimental 
setting. Then, the evaluation results compared with state- 
of-the-arts on two tasks, i.e.. Subject-Verb-Object (SVO) 


triplet prediction and natural sentence generation tasks, are 
reported. Finally, the effect of tradeoff parameter between 
coherence and relevance and the size of hidden layer in 
LSTM are presented. 

5.1. Experimental Setting 

We conduct our experiments mainly on the Microsoft 
Research Video Description Corpus (YouTube2Text) O, 
which have been used in several prior works CH1131341 on 
action recognition and video description generation tasks. 
This video corpus contains 1,970 YouTube snippets which 
cover a wide range of daily activities such as “people do¬ 
ing exercises,” “playing music,” and “cooking.” We use the 
roughly 40 available English descriptions per video. In our 
experiments, following the setting used in prior works on 
video description generation |[IT|[34l, we pick 1,200 videos 
to be used as training data, 100 videos for validation and 
670 videos for testing. 

We compare our LSTM-E architecture with two 2-D 
CNN of AlexNet IT^ and the 19-layer VGG ll25]| network 
both pre-trained on Imagenet ILSVRC12 dataset 1^ . and 
one 3-D CNN of C3D 1^ pre-trained on Sports-IM video 
dataset ca. Specifically, we take the output of 4096- 
way fc7 layer from AlexNet, 4096-way fc6 layer from the 
19-layer VGG, and 4096-way fc6 layer from C3D as the 
frame/clip representation, respectively. The dimensionality 
of the visual-semantic embedding space and the size of hid¬ 
den layer in LSTM are both set to 512. The tradeoff param¬ 
eter A leveraging the relevance loss and coherence loss is 
empirically set to 0.7. The sensitivity of A will be discussed 
in Section [53T] 

5.2. Performance Comparison 

We empirically verify the merit of our LSTM-E model 
from two aspects: SVO triplet prediction and sentence gen¬ 
eration for the video-language translation. 

5.2.1 Compared Approaches 

To fully evaluate our model, we compare our LSTM-E mod¬ 
els with the following non-trivial baseline methods. 

• Conditional Random Field (CRF) ll34l : CRF model is 
developed to incorporate subject-verb and verb-object 
pairwise relationship based on the word pairwise co¬ 
occurrence statistics in the sentence pool. 

• Canonical Correlation Analysis (CCA) ll26ll : CCA is 
to build the video-language joint space and generate 
the SVO triplet by k-nearest-neighbors search in the 
sentence pool. 

• Factor Graph Model (FGM) |[29l: FGM combines 
knowledge mined from text corpora with visual con- 





5.2.2 SVO Triplet Prediction Task 


Table 1. SVO accuracy: Binary accuracy of SVO triplet predic¬ 
tion. We extract SVO triplets from sentences output by LSTM and 
LSTM-E using a dependency parser. 


Model 

s% 

V% 

o% 

FGM 

76.42 

21.34 

12.39 

CRF 

77.16 

22.54 

9.25 

CCA 

77.16 

21.04 

10.99 

JEM 

78.25 

24.45 

11.95 

LSTM 

71.19 

19.40 

9.70 

LSTM-E (Alex) 

78.66 

24.78 

10.30 

LSTM-E (VGG) 

80.30 

27.91 

12.54 

LSTM-E (C3D) 

77.31 

28.81 

12.39 

LSTM-E (VGG+C3D) 

80.45 

29.85 

13.88 


fidence using a factor graph and performs probabilistic 
inference to determine the most likely SVO triplets. 

• Joint Embedding Model (JEM) |[34l : Proposed most 
recently, JEM jointly models video and the corre¬ 
sponding text sentences by minimizing the distance 
of the deep video and compositional text in the joint 
space. 

• Long Shot-Term Memory (LSTM): LSTM attempts to 
directly translate from video pixels to natural language 
with a single deep neural network. The video represen¬ 
tation is by performing mean pooling over the features 
of frames using AlexNet. 

• Soft-Attention (SA) 1^ : SA combines the frame rep¬ 
resentation from GoogleNet ll^ and video clip rep¬ 
resentation based on a 3-D ConvNet trained on His¬ 
tograms of Oriented Gradients (HOG), Histograms 
of Optical Elow (HOE), and Motion Boundary His¬ 
togram (MBH) hand-crafted descriptors. Eurthermore, 
a weighted attention mechanism is used to dynami¬ 
cally attend to specific temporal regions of the video 
while generating sentence. 

• Sequence to Sequence - Video to Text (S2VT) lISTI : 
S2VT incorporates both RGB and optical fiow inputs, 
and the encoding and decoding of the inputs and word 
representations are learnt jointly in a parallel manner. 

• Long Shot-Term Memory with visual-semantic Em¬ 
bedding (LSTM-E): We design four runs for our 
proposed approach, i.e., LSTM-E (Alex), LSTM-E 
(VGG), LSTM-E (C3D), and LSTM-E (VGG-fC 3D). 
The input frame/clip features of the first three runs are 
from AlexNet, VGG and C3D network respectively. 
The input of the last one is to concatenate the features 
from VGG and C3D. 


As SVO triples can capture the compositional semantics of 
videos, predicting SVO triplet could indicate the quality of 
a translation system to a large extent. 

We adopt SVO accuracy oa which measures the ex¬ 
actness of SVO words by binary (0-1 loss), as the evalu¬ 
ation metric. Table details SVO accuracy of compared 
seven models. Within these models, the former four models 
(called Item driven models) explicitly optimize to identify 
the best subject, verb and object items for a video; while 
the later five models (named Sentence driven models) focus 
on training on objects and actions jointly in a sentence and 
learn to interpret these in different contexts. Eor the later 
five sentence driven models, we extract the SVO triplets 
from the generated sentences by Stanford Parsei[^ and the 
words are also stemmed. Overall, the results across SVO 
triplet indicate that almost all the four Item driven models 
exhibit better performance than LSTM model which pre¬ 
dicts the next word by only considering the contextual rela¬ 
tionships with the previous words given the video content. 
By jointly modeling the relevance between the semantics of 
the entire sentence and video content with LSTM, LSTM- 
E significantly improves LSTM. Eurthermore, the perfor¬ 
mances of LSTM-E (VGG), LSTM-E (C3D), and LSTM- 
E (VGG-FC3D) on Subject, Verb and Object are all above 
that of the four Item driven models. The result basically 
indicates the advantage of further exploring the relevance 
holistically between the semantics of the entire sentence and 
video content in addition to LSTM. 

Compared to LSTM-E (Alex), LSTM-E (VGG) using a 
more powerful frame representation brought by a deeper 
CNN exhibits significantly better performance. In addi¬ 
tion, LSTM-E (C3D) which has a better ability in encap¬ 
sulating temporal information leads to better performance 
than LSTM-E (VGG) in terms of Verb prediction accuracy. 
When combining the features from VGG and C3D, LSTM- 
E (VGG-FC3D) further increases the performance gains. 

5.2.3 Sentence Generation Task 

Eor item driven models including EGM, CRE, CCA and 
JEM, the sentence generation is often performed by lever¬ 
aging a series of simple sentence templates (or special lan¬ 
guage trees) on the SVO triplets 13^ . Having verified in 
(321, using LSTM architecture can lead to a large perfor¬ 
mance boost against the template-based sentence genera¬ 
tion. Thus, Table only shows comparisons of LSTM- 
based sentence generations. We use the BLEU@A^ (2^ 
and METEOR scores d against all ground truth sentences. 
Both metrics have been shown to correlate well with human 
judgement, and widely used in machine translation litera- 

^ http://nlp .stanford.edu/ software/lex-parser, shtml 











Table 2. BLEU@A^ and METEOR scores for comparing the quality of the sentence generation. All values are reported as percentage (%). 


Model 

METEOR 

BLEU@1 

BLEU@2 

BLEU@3 

BLEU@4 

LSTM 

26.9 

69.8 

53.3 

42.1 

31.2 

SA 

29.6 

80.0 

64.7 

52.6 

42.2 

S2VT 

29.8 

- 

- 

- 

- 

LSTM-E (Alex) 

28.3 

74.5 

59.8 

49.3 

38.9 

LSTM-E (VGG) 

29.5 

74.9 

60.9 

50.6 

40.2 

LSTM-E (C3D) 

29.9 

75.7 

62.3 

52 

41.7 

LSTM-E (VGG+C3D) 

31.0 

78.8 

66.0 

55.4 

45.3 



Eigure 4. The effect of the tradeoff parameter A measured by 
BLEU@iV and METEOR. 


ture. Specifically, BLEU@A^ measures the fraction of N- 
gram (up to 4-gram) that are in common between a hypoth¬ 
esis and a reference or set of references, while METEOR 
computes unigram precision and recall, extending exact 
word matches to include similar words based on WordNet 
synonyms and stemmed tokens. As shown in the Table 
the qualitative results across different N of BLEU and ME¬ 
TEOR consistently indicate that the LSTM-E (Alex) sig¬ 
nificantly outperforms the traditional LSTM model. More¬ 
over, we can find that the performance gain of BLEU@A^ 
becomes larger when N increases, where N measures the 
length of the contiguous sequence in the sentence. This 
again confirms that LSTM-E is benefited from the way of 
holistically exploring the relationships between the seman¬ 
tics of the entire sentence and video content by minimiz¬ 
ing the distance of their mappings in a visual-semantic em¬ 
bedding. Similar to the observations in SVO prediction 
task, our LSTM-E (VGG) outperforms LSTM-E (Alex) and 
can reach 29.5% METEOR. Eurthermore, LSTM-E (C3D) 
achieves 29.9% METEOR and improves the performance 
to 31.0% when combined with VGG, which makes the im¬ 
provement over the current two state-of-the-art methods SA 
by 4.7% and S2VT by 4.0%, respectively. 

Eigure shows a few sentence examples generated by 
different methods and human-annotated ground truth. Erom 
these exemplar results, it is easy to see that all of these au¬ 
tomatic methods can generate somewhat relevant sentences. 
When looking into each word, both LSTM-E (Alex) and 
LSTM-E (VGG-FC3D) predict more relevant Subject, Verb 
and Object (SVO) terms. Eor example, compared to subject 


Table 3. The effect of hidden layer size in our LSTM-E 
(VGG+C3D) framework measured by BLEU@4 and METEOR. 


Hidden 
layer size 

BLEU@4 

METEOR 

Parameter 

number 

128 

38.4 

29.0 

3.6M 

256 

40.6 

29.6 

7.5M 

512 

45.3 

31.0 

16.0M 


term “a man”, ‘People” or “a group of men” is more precise 
to describe the video content in the second video. Similarly, 
verb term “singing” presents the fourth video more exactly. 
The predicted object terms “keyboard” and “motorcycle” 
are more relevant than “guitar” and “car” in fifth and sixth 
videos, respectively. Moreover, LSTM-E (VGG+C3D) can 
offer more coherent sentences. Eor instance, the generated 
sentence “a man is talking on a phone” of the third video 
encapsulates the video content more clearly. 

5.3. Experimental Analysis 

We will further provide the analysis on the effect of the 
tradeoff parameter between two losses and the size of hid¬ 
den layer in LSTM learning. 


5.3.1 The Tradeoff Parameter A 


To clarify the effect of the tradeoff parameter A in Eq.(pT]), 
we illustrate the performance curves with a different trade¬ 
off parameter in Eigure]^ To make all performance curves 
fall into a comparable scale, all BLEU@A^ and METEOR 
values are specially normalized as follows 


m. = 


mx — min {rux} 
A 

nun {mx} 


(13) 


where mx and m'^ denotes original and normalized perfor¬ 
mance values (BLEU@A^ or METEOR) with a set of A, 
respectively. 

Erom the figures, we can see that all performance curves 
are like the “A” shapes when A varies in a range from 0.1 to 
0.9. The best performance is achieved when A is about 0.7. 
This proves that it is reasonable to jointly learn the visual- 
semantic embedding space in the deep recurrent neural net¬ 
works. 






















LSTM: a cat is playing with a mirror 
LSTM-E (Alex): a cat is playing 
with a watermelon 
LSTM-E (VGG+C3D): a kitten is 
playing with a toy 

LSTM: a man is dancing 
LSTM-E (Alex): people are 
dancing 

LSTM-E (VGG+C3D): a group of 
people are dancing 


LSTM: a woman is talking 
LSTM-E (Alex): a man is talking 
LSTM-E (VGG+C3D): a man is 
talking on a phone 


LSTM: a man is playing a flute 
LSTM-E (Alex): a man is singing 

LSTM-E (VGG+C3D): a man is 

singing 


LSTM: a man is playing a guitar 

LSTM-E (Alex): a man is 
playing a keyboard 
LSTM-E (VGG+C3D): a man is 

playing a piano 

LSTM: a man is riding a car 
LSTM-E (Alex): a man is riding a 
bicycle 

LSTM-E (VGG+C3D): a man is 

riding a motorcycle 


Ground Truth: 

® a kitten is playing with his toy 
@ a cat is playing on the floor 
d) a kitten plays with a toy 


Ground Truth: 

® a group of people are dancing 
@ people are dancing outside 
@ many people dance in the street 


Ground Truth: 

® a man is talking on a cell phone 
@ a man is speaking into a cell 
phone 

@ the man talked on the phone 

Ground Truth: 

® a man is singing on stage 
@ a man is singing into a 
microphone 

@ a man sings into a microphone 

Ground Truth: 

® a person is playing a piano 
keyboard 

@ a man plays a keyboard 
@ a boy played a keyboard 


Ground Truth: 

® someone is riding a motorcycle 
@ a man is riding his motorcycle 
@ a man is riding on a motor bike 


Figure 5. Examples of sentence generation results. The videos are represented by sampled frames, the output sentences generated by 1) 
LSTM, 2) our proposed LSTM-E (Alex) and LSTM-E (VGG-1-C3D), and 3) Ground Truth: Randomly selected three ground truth sentences. 


5.3.2 The Size of hidden layer of LSTM 

In order to show the relationship between the performance 
and hidden layer size of LSTM, we compare the results of 
the hidden layer size in the range of 128, 256, and 512. The 
results shown in Tablej^indicate increasing the hidden layer 
size can lead to the improvement of the performance with 
respect to both BLEU@4 and METEOR. Therefore, in our 
experiments, the hidden layer size is empirically set to 512, 
which achieves the best performance. 

6. Discussion and Conclusion 

In this paper, we have proposed a solution to the video 
description problem by introducing a novel LSTM-E model 
structure. In particular, a visual-semantic embedding space 
is additionally incorporated into LSTM learning. In this 
way, a global relationship between the video content and 
sentence semantics is simultaneously measured in addition 
to the local contextual relationship between the word at each 
step and the previous ones in LSTM learning. On a popu¬ 
lar video description dataset, the results of our experiments 
demonstrate the success of our approach, outperforming the 
current state-of-the-art model with a significantly large mar¬ 
gin on both SVO prediction and sentence generation. 

Our future works are as follows. Eirst, as a video it¬ 
self is a temporal sequence, the way of better representing 


the videos by using RNN will be further explored. More¬ 
over, the video description generation might be significantly 
boosted if we could have sufficient labeled video-sentence 
pairs to train a deeper RNN. 
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